Spike a /distributions endpoint in spec to reduce /version size#441
Spike a /distributions endpoint in spec to reduce /version size#441
Conversation
|
So on your points:
Also I don't see an issue with also having a
So while I get that the response is long, I don't think it's too messy, at most we get code which looks like: edition.distributions[0].table_schema.columns
// Or including embedded
edition._embedded.distributions[0].table_schema.columnsSo there's a bit of nesting there, but I'd say that feels perfectly pleasant to use and navigate with. Personally I'd be a bit more annoyed with having to make another call. What we have shapes up into the sort of thing Google are looking for as part of their structured data SEO... they even go as far as throwing in the observations. Another thought is, many users won't really appreciate the difference between a dataset and its distributions. Many analysts will say things like "a dataset has columns" which isn't how DCAT models it. So I don't think including a distribution along with its edition/version is a bad thing. Assuming a structure roughly like this: curl https://data.ons.gov.uk/datasets/cpih/editions/2022-01{
"@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01",
"@type": "dcat:Dataset",
// some additional stuff here...
"distributions": [
{
"@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.csv",
"@type": "dcat:Distribution",
"byte_size": 123456,
"download_url": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.csv",
"media_type": "text/csv",
"table_schema": {
"columns": [
{
"name": "area",
"titles": "area",
"datatype": "string",
"label": "Area",
"description": "The area of an observation."
},
{
"name": "period",
"titles": "period",
"datatype": "string",
"label": "Period",
"description": "The period of an observation."
},
{
"name": "sex",
"titles": "sex",
"datatype": "string",
"label": "Sex",
"description": "Biological sex of observed individuals."
},
{
"name": "life_expectancy",
"titles": "life_expectancy",
"datatype": "decimal",
"label": "Average life expectancy",
"description": "Mean life expectancy of observed individuals."
}
]
}
},
{
"@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.xlsx",
"@type": "dcat:Distribution",
"byte_size": 123456,
"download_url": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.csv",
"media_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
}
],
"versions": [
{
"@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01/versions/1",
"issued": "2022-01-01T00:00:00Z",
"modified": "2022-01-01T00:00:00Z"
},
{
"@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01/versions/2",
"issued": "2022-01-02T00:00:00Z",
"modified": "2022-01-02T00:00:00Z"
}
]
} |
What
Fundamentally, we face an issue with the complexity of returning dcat:dataset and dcat:distribution details as one joined response while meeting API user expectations. The issue is the nested table_schema within the distribution within the dataset. Having 3 layers of nested document on /editions/ creates a very messy API, and suggests that at least one of those nested resources is a sub resource of another. As dcat:distribution is a known resource type, this is an attempt to see if the API could mirror that distinction while still providing utility for users.
This means /versions/ has no top level
distributionortable_schemaobject, this is instead embedded like thisThe dimensions list is there for census reasons, relocated from previous top level field.
But that more complete view of distributions and table_schema for CSV distros is available on a
/distributionsendpoint with this response type:{ "_links": { "self": { "href": "string" }, "next": { "href": "string" }, "prev": { "href": "string" } }, "count": 0, "limit": 0, "offset": 0, "total_count": 0, "@context": "string", "items": [ { "checksum": "string", "described_by": "string", "table_schema": { "about_url": "string", "column": [ { "component_type": "string", "datatype": "string", "name": "string", "title": "string" } ] }, "@id": "string", "byte_size": "string", "media_type": "string", "download_url": "string", "etag": "string", "@type": "dcat:distribution", "_links": { "self": { "href": "string" }, "version": { "href": "string", "id": "string" } } } ] }How to review
There are 3 main issues with this approach that I think I need feedback on:
dcat:datasetresource being represented (/editions/ embedsversionsanddistributions, /versions/ embedsdimensionsanddistributions- just based on my best guess of what would be useful). On the one hand we said the HAL fields _embedded and _links are generated and so it's acceptable for these to change between endpoints, but if they're forming part of the dcat resource is that still the case?Who can review
@janderson2 @rossbowen